116 research outputs found

    Multilingual Sentence Categorization according to Language

    Full text link
    In this paper, we describe an approach to sentence categorization which has the originality to be based on natural properties of languages with no training set dependency. The implementation is fast, small, robust and textual errors tolerant. Tested for french, english, spanish and german discrimination, the system gives very interesting results, achieving in one test 99.4% correct assignments on real sentences. The resolution power is based on grammatical words (not the most common words) and alphabet. Having the grammatical words and the alphabet of each language at its disposal, the system computes for each of them its likelihood to be selected. The name of the language having the optimum likelihood will tag the sentence --- but non resolved ambiguities will be maintained. We will discuss the reasons which lead us to use these linguistic facts and present several directions to improve the system's classification performance. Categorization sentences with linguistic properties shows that difficult problems have sometimes simple solutions.Comment: 4 pages --- LaTe

    Daniel@FinTOC-2019 Shared Task : TOC Extraction and Title Detection

    Get PDF
    International audienceWe present different methods for the two tasks of the 2019 FinTOC challenge: Title Detection and Table of Contents Extraction. For the Title Detection task we present different approaches using various features : visual characteristics , punctuation density and character n-grams. Our best approach achieved an official F-measure score of 94.88%, ranking 6 on this task. For the TOC extraction task, we presented a method combining visual characteristics of the document layout. With this method we ranked first on this task with 42.72%

    Web Page Segmentation for Non Visual Skimming

    Get PDF
    International audienceWeb page segmentation aims to break a page into smaller blocks, in which contents with coherent semantics are kept together. Examples of tasks targeted by such a technique are advertisement detection or main content extraction. In this paper, we study different seg-mentation strategies for the task of non visual skimming. For that purpose, we consider web page segmentation as a clustering problem of visual elements, where (1) all visual elements must be clustered, (2) a fixed number of clusters must be discovered, and (3) the elements of a cluster should be visually connected. Therefore, we study three different algorithms that comply to these constraints: K-means, F-K-means, and Guided Expansion. Evaluation shows that Guided Expansion evidences statistically-relevant results in terms of compactness and separateness, and satisfies more logical constraints when compared to the other strategies

    Web Page Segmentation for Non Visual Skimming

    Get PDF

    Concurrent Speech Synthesis to Improve Document First Glance for the Blind

    Get PDF
    International audienceSkimming and scanning are two well-known reading processes, which are combined to access the document content as quickly and efficiently as possible. While both are available in visual reading mode, it is rather difficult to use them in non visual environments because they mainly rely on typographical and layout properties. In this article, we introduce the concept of tag thunder as a way (1) to achieve the oral transposition of the web 2.0 concept of tag cloud and (2) to produce an innovative interactive stimulus to observe the emergence of self-adapted strategies for non-visual skimming of written texts. We first present our general and theoretical approach to the problem of both fast, global and non-visual access to web browsing; then we detail the progress of development and evaluation of the various components that make up our software architecture. We start from the hypothesis that the semantics of the visual architecture of web pages can be transposed into new sensory modalities thanks to three main steps (web page segmentation, keywords extraction and sound spatialization). We note the difficulty of simultaneously (1) evaluating a modular system as a whole at the end of the processing chain and (2) identifying at the level of each software brick the exact origin of its limits; despite this issue, the results of the first evaluation campaign seem promising

    L'analyse automatique de forums de discussion dans un contexte pédagogique

    No full text
    International audienc

    Méthode pour l'analyse automatique de structures formelles sur documents multilingues

    No full text
    This thesis deals with automatic parsing of formal structures in written texts.It begins with a presentation of documents in their multilingual dimension and ofthe necessity to process them in this way. We study their multilingual structureand present how to compute it with the help of a language identification tool.Then, we present an original syntactic parsing method of unrestricted frenchsentences. This method is a generalization and an abstraction of Jacques Vergne'sresearches. The syntactic structures we are interested in are the minimal syntagmand the proposition ; both units can be defined as multilingual units so that themethod can be applied to various languages.We propose two processes which allow the building of these units. Both processesconsider texts as flows and build syntactic structures thanks to a relationalconstraints propagation. As the syntagmatic and propositional structures are dependent,they are built up by the interaction of the two processes. We show thatboth processes are identical if we disregard the nature of the unit they build upand the rule base they use.The main thread of this thesis is the method. Each time a process is described,we emphasize the related method. We show that this method is unique. Eachstructure is computed with the help of formal and positionnal clues: these cluescome from the study of the units located inside the structure (internal clues) orfrom the study of the function of the structure in its upper-level units (externalclues).Cette thèse traite de l'analyse automatique de structures formelles de l'écrit.Elle commence par une excursion dans le multilinguisme au cours de laquelle nousprésentons les documents dans leur dimension multilingue et montrons la nécessitéde les traiter comme tels. Nous étudions leur structure multilingue et développonscomment la calculer à l'aide d'un identificateur de langues.Nous poursuivons par l'exposé d'une méthode originale d'analyse syntaxiqueautomatique d'énoncés français tout-venants. Cette méthode est issue de nos travauxde généralisation et d'abstraction des recherches de Jacques Vergne. Lesstructures syntaxiques auxquelles nous nous sommes particulièrement intéressésont le syntagme minimal et la proposition ; deux unités auxquelles il est possibled'associer une définition ayant une validité multilingue, ce qui rend la méthodeapplicable à diverses langues.Nous proposons deux processus permettant la construction de ces unités. Cesprocessus considèrent les énoncés comme des flux textuels et construisent chacunleurs structures syntaxiques par propagation de contraintes relationnelles. Lesstructures intra-syntagmatique et intra-propositionnelle étant dépendantes, ellessont construites par l'interaction des deux processus, le second processus acceptantde travailler sur des unités partiellement définies. Enfin, nous montrons queles deux processus sont identiques si l'on fait abstraction de la nature de l'unitéqu'ils construisent et de la base de règles qu'ils manipulent.Le fil conducteur de cette thèse est la méthode. A chaque calcul de structure,nous mettons en effet l'accent sur la méthode ayant permis son obtention. Nousmontrons que cette méthode est unique. Chaque structure est en effet calculée àpartir d'indices formels et positionnels à la fois internes et externes : internes parl'étude des unités qui composent la structure, externes par l'étude du rôle de cettestructure dans l'unité qui l'intègre

    Méthode pour l'analyse automatique de structures formelles sur documents multilingues

    No full text
    This thesis deals with automatic parsing of formal structures in written texts.It begins with a presentation of documents in their multilingual dimension and ofthe necessity to process them in this way. We study their multilingual structureand present how to compute it with the help of a language identification tool.Then, we present an original syntactic parsing method of unrestricted frenchsentences. This method is a generalization and an abstraction of Jacques Vergne'sresearches. The syntactic structures we are interested in are the minimal syntagmand the proposition ; both units can be defined as multilingual units so that themethod can be applied to various languages.We propose two processes which allow the building of these units. Both processesconsider texts as flows and build syntactic structures thanks to a relationalconstraints propagation. As the syntagmatic and propositional structures are dependent,they are built up by the interaction of the two processes. We show thatboth processes are identical if we disregard the nature of the unit they build upand the rule base they use.The main thread of this thesis is the method. Each time a process is described,we emphasize the related method. We show that this method is unique. Eachstructure is computed with the help of formal and positionnal clues: these cluescome from the study of the units located inside the structure (internal clues) orfrom the study of the function of the structure in its upper-level units (externalclues).Cette thèse traite de l'analyse automatique de structures formelles de l'écrit.Elle commence par une excursion dans le multilinguisme au cours de laquelle nousprésentons les documents dans leur dimension multilingue et montrons la nécessitéde les traiter comme tels. Nous étudions leur structure multilingue et développonscomment la calculer à l'aide d'un identificateur de langues.Nous poursuivons par l'exposé d'une méthode originale d'analyse syntaxiqueautomatique d'énoncés français tout-venants. Cette méthode est issue de nos travauxde généralisation et d'abstraction des recherches de Jacques Vergne. Lesstructures syntaxiques auxquelles nous nous sommes particulièrement intéressésont le syntagme minimal et la proposition ; deux unités auxquelles il est possibled'associer une définition ayant une validité multilingue, ce qui rend la méthodeapplicable à diverses langues.Nous proposons deux processus permettant la construction de ces unités. Cesprocessus considèrent les énoncés comme des flux textuels et construisent chacunleurs structures syntaxiques par propagation de contraintes relationnelles. Lesstructures intra-syntagmatique et intra-propositionnelle étant dépendantes, ellessont construites par l'interaction des deux processus, le second processus acceptantde travailler sur des unités partiellement définies. Enfin, nous montrons queles deux processus sont identiques si l'on fait abstraction de la nature de l'unitéqu'ils construisent et de la base de règles qu'ils manipulent.Le fil conducteur de cette thèse est la méthode. A chaque calcul de structure,nous mettons en effet l'accent sur la méthode ayant permis son obtention. Nousmontrons que cette méthode est unique. Chaque structure est en effet calculée àpartir d'indices formels et positionnels à la fois internes et externes : internes parl'étude des unités qui composent la structure, externes par l'étude du rôle de cettestructure dans l'unité qui l'intègre

    Multilingual Sentence Categorization according to Language

    No full text
    International audienceIssues in sentence categorization according to language is fundamental for NLP, especially in document processing. In fact, with the growing amount of multilingual text corpus data becoming available, sentence categorization, leading to multilingual text structure, opens a wide range of applications in multilingual text analysis such as information retrieval or preprocessing of multilingual syntactic parser
    corecore